This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Cleaning data

Overview

This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.

HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).

The cleaning steps demonstrated include:

  • Loading the data
  • Variable name cleaning
  • Variable selection
  • Designating variable classes
  • Filtering rows
  • Re-coding values
  • Creating groups (case_when())
  • Dealing with character case (upper, lower, title, etc.)
  • Factor variables

replace missing with dealing with cases (all lower, etc) case_when() factors

Preparation

Load packages

pacman::p_load(tidyverse,  # data manipulation and visualization
               janitor,    # data cleaning
               epitrix,    # data cleaning
               )

Load data

Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)

linelist_raw <- import("ebola_simulated.xlsx")

You can view the original raw dataset below:

# display the linelist data as a table
DT::datatable(linelist_raw, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

A cleaning pipeline

Text here about a cleaning pipeline of dplyr verbs… order is important

Variable names

Variable names are used so often, it is best that they have “clean” syntax. We suggest the following:

  • short
  • no spaces (replaced with underscores (_),
  • no unusual characters (&, #…)
  • similar nomenclature (e.g. all dates like date_onset, date_report, date_death)

The names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).

names(linelist_raw)
##  [1] "row_num"         "case_id"         "generation"      "infection date" 
##  [5] "date onset"      "hosp date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"
Note: To use a variable names that include spaces, surround the name with back-ticks, for example: linelist$`infection date` 
On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).

Automatic syntax cleaning

The function clean_names() from the package janitor standardizes variable names by transliterating to unique ASCII names by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can designate specific replacements with the replace = argument (e.g. replace = c(onset = “date_of_onset”))
  • Here is an online vignette
# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "row_num"         "case_id"         "generation"      "infection_date" 
##  [5] "date_onset"      "hosp_date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"

Manual variable name cleaning

Re-naming variables manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new variable name is given before the old variable name.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize variable name syntax
    janitor::clean_names() %>% 
    
    # manually re-name variables
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the variables names have been changed:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Select variables

CAUTION: This tab may follow from previous tabs.

Often the first step of cleaning data is selecting the variables you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.

Here are all the variable names in the linelist:

names(linelist)
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_onset"           "date_hospitalisation"
##  [7] "date_outcome"         "outcome"              "gender"              
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"

With select() you can do the following:

Select only the variables you want to remain, and their order of appearance

# linelist dataset is piped through select() command, and names() prints just the variable names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names() # display the variable names
## [1] "case_id"              "date_onset"           "date_hospitalisation"
## [4] "fever"

Indicate which variables to remove by placing a minus symbol “-” in front of the variable name (e.g. select(-outcome)), or a vector of variable names (as below). All other variables will be retained.
Inside select() you can use normal operators such as c() to list several variables, : for consecutive variables, ! for opposite, & for AND, and | for OR.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom variables
  names()
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_hospitalisation" "date_outcome"        
##  [7] "outcome"              "gender"               "hospital"            
## [10] "lon"                  "lat"                  "infector"            
## [13] "source"               "age"                  "age_unit"

Re-order the variables - use everything() to signify all other variables not specified in the select() command:

# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, gender, everything()) %>% 
  names()
##  [1] "case_id"              "date_onset"           "date_hospitalisation"
##  [4] "gender"               "row_num"              "generation"          
##  [7] "date_infection"       "date_outcome"         "outcome"             
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"

As well as everything() there are several special functions that work within select(), namely:

  • everything() - all other variables not mentioned
  • last_col() - the last column
  • starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
  • ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
  • contains() - variables containing a character string. Example: select(contains("time"))
  • matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
  • num_range() -
  • any_of() - matches if variable is named. Useful if name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))
  • where() - applies a function to all variables and selects those which are TRUE

Adding select()to the cleaning pipe chain:

In the linelist, there is one variable we do not need: row_num. Remove it by adding a select() command to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize variable name syntax
    janitor::clean_names() %>% 
    
    # manually re-name variables
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-row_num)

select() as a stand-alone command

Select can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with only three variables
linelist_age <- select(linelist, case_id, age, age_unit)

Fix classes

CAUTION: This tab may follow from previous tabs.

See section on object classes

Here we want to ensure that the class of each variable is appropriate. First we run some checks on the classes of important variables.

The class of the “age” variable is character. To perform analysis, we need those numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” variable is also character! To perform analysis, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

Use table() or another method to see all the values, can see that we see that one date was entered in a different format (15 April 2014) than all the others!

## 
## 15 April 2014    2010-05-07    2010-05-08    2010-05-27    2010-06-15 
##             1             1             1             1             1 
##    2010-06-18 
##             1

This means before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data. Or, we can do this using mutate() and recode() in our cleaning pipe chain, before the commands to convert to class Date. LINK TO CLASSIFYING VARIABLE AS DATE.

The new mutate line can be read as: mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE. Note that this pattern (OLD = NEW) is the opposite of most R patterns. The R development community is working on revising this.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize variable name syntax
    janitor::clean_names() %>% 
    
    # manually re-name variables
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-row_num) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################
  # fix incorrect values                 # old value       # new value
  mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>% 
  
  # correct the class of the variables
  mutate(age           = as.numeric(age),
         date_onset    = as.Date(date_onset, format = "%Y-%m-%d"))

Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.

Fix class for multiple variables at once

class(linelist$date_infection)
## [1] "POSIXct" "POSIXt"
head(linelist$date_infection)
## [1] "2014-04-09 UTC" NA               NA               "2014-05-07 UTC"
## [5] NA               "2014-05-06 UTC"

You can use The dplyr function across() with mutate() to convert several variables at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the variables where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.

  • Note that within across() we also use the function where().
  • Note that is.POSIXct is from the package lubridate. Other similar functions (is.character(), is.numeric(), and is.logical()) are from base R
  • Note that the functions in across() are written without the empty parentheses ()
linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))

Filter rows

CAUTION: This tab may follow from previous tabs.

After selecting columns, a typical cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept. A simple example below filters to keep only the rows where case_id is not missing (three rows are removed):

linelist <- linelist %>% 
  filter(!is.na(case_id))  # keep only rows where case_id is not missing

A more complex example:

Run a simple histogram of onset dates to see that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

If we simply filter linelist by date of onset (after June 2013) we may make a mistake! Applying filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any observations in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date can remove any observations with missing date values (NA)! This is because NA is treated as infinitely large and small.

We also know that this first epidemic occurred at Hospital A, Hospital B, and there were 10 cases at Connaught Hospital. Hospitals A & B did not have cases in the second epidemic, but Connaught Hospital had many. This is a complex filter to apply - it is wise to tabulate these variables to know exactly how many rows we expect should be removed.

Let’s examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of the date_onset
      useNA     = "always")                              # show missing values
##                                               YearOnset
## Hospital                                       2010 2011 2014 2015 <NA>
##   Connaught Hopital                               0    0   38    6    3
##   Connaught Hospital                              8    2 1322  327   66
##   Hospital A                                     40   13    0    0    1
##   Hospital B                                     41   12    0    0    1
##   Military Hopital                                0    0   20    8    2
##   Military Hospital                               0    0  580  168   38
##   Mitylira Hopital                                0    0    1    0    0
##   Mitylira Hospital                               0    0   59   17    3
##   other                                           0    0  679  168   38
##   Princess Christian Maternity Hopital (PCMH)     0    0   10    1    0
##   Princess Christian Maternity Hospital (PCMH)    0    0  306   90   15
##   Rokupa Hopital                                  0    0   10    1    0
##   Rokupa Hospital                                 0    0  332   94   17
##   <NA>                                            0    0 1106  296   67

We want to exclude only the nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) observations from 2012 and 2013 at those three hospitals (A, B, and Connaught), including the 2 from Hospitals A & B with missing onset dates, but not any others with missing onset dates. We start with a linelist of nrow(linelist). Here is our statement:

linelist <- linelist %>% 
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 5888

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, the 10 Connaught Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of the date_onset
      useNA     = "always")                              # show missing values
##                                               YearOnset
## Hospital                                       2014 2015 <NA>
##   Connaught Hopital                              38    6    3
##   Connaught Hospital                           1322  327   66
##   Military Hopital                               20    8    2
##   Military Hospital                             580  168   38
##   Mitylira Hopital                                1    0    0
##   Mitylira Hospital                              59   17    3
##   other                                         679  168   38
##   Princess Christian Maternity Hopital (PCMH)    10    1    0
##   Princess Christian Maternity Hospital (PCMH)  306   90   15
##   Rokupa Hopital                                 10    1    0
##   Rokupa Hospital                               332   94   17
##   <NA>                                         1106  296   67

Multiple filter statements can be separated by commas, or you can always pipe to a separate filter() statement for clarity. Adding these filter statements to the cleaning pipe chain now looks like this:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize variable name syntax
    janitor::clean_names() %>% 
    
    # manually re-name variables
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-row_num) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the variables
    mutate(age           = as.numeric(age),
           date_onset    = as.Date(date_onset, format = "%Y-%m-%d")) %>% 
    
  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(!is.na(case_id),  # keep only rows where case_id is not missing
           date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))
           ) # close filter

Standalone command

filter(dataset, criteria) OR subset like: dataset_new <- dataset[criteria,criteria]

nrow(linelist %>% filter())

Creating and modifying variables

CAUTION: This tab may follow from previous tabs.

We advise creating new variables with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use the base R style to create a new variable.

Using dplyr verbs

As explained in the section on dplyr and tidyverse coding style (LINK HERE), a chain of ‘verb’ functions operate on a dataset through ‘pipes’ (%>%), passing the output from one verb to the next. The verb mutate() used to add a new variable or modify an existing one.

Below are some example of creating new variables with mutate(). The syntax is: new_variable_name = value or function. It is best practice to separate each new variable with a comma and new line.

linelist <- linelist %>%                   # creating new, or modifying old dataset
  mutate(new_var_dup    = case_id,                  # new variable = duplicate/copy another variable
         new_var_static = 7,                   # new variable = all values the same
         new_var_static = new_var_static + 5,  # you can overwrite a variable, and can modify a variable multiple times
         new_var_calc   = (age / 12),          # new variable = a calculation
         new_var_paste  = paste0(hospital, " (", date_hospitalisation, ")") # new variable = pasting together values from other variables
         ) 

Scroll to the right to see the new variables:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

TIP: The verb transmute() adds new variables just like mutate() but also drops/removes all other variables that you do not mention.

Recode a variable

To recode the values in a variable, mutate() is also used.

In this case, we notice that we need to clean the variable “hospital”. There are several incorrect spelling, and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                            Connaught Hopital 
##                                           47 
##                           Connaught Hospital 
##                                         1715 
##                             Military Hopital 
##                                           30 
##                            Military Hospital 
##                                          786 
##                             Mitylira Hopital 
##                                            1 
##                            Mitylira Hospital 
##                                           79 
##                                        other 
##                                          885 
##  Princess Christian Maternity Hopital (PCMH) 
##                                           11 
## Princess Christian Maternity Hospital (PCMH) 
##                                          411 
##                               Rokupa Hopital 
##                                           11 
##                              Rokupa Hospital 
##                                          443 
##                                         <NA> 
##                                         1469

For to change spellings one-by-one, you can use the recode() function within the mutate function. The code is saying that the variable “hospital” should be defined as the current variable “hospital” but with certain changes (syntax is OLD = NEW). Don’t forget commas!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                           # OLD = NEW
                           "Mitylira Hopital"  = "Military Hospital",
                           "Mitylira Hospital" = "Military Hospital",
                           "Military Hopital"  = "Military Hospital",
                           "Connaught Hopital" = "Connaught Hospital",
                           "Rokupa Hopital"    = "Rokupa Hospital",
                           "other"             = "Other",
                           "Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
                           ))

table(linelist$hospital, useNA = "always")
## 
##                           Connaught Hospital 
##                                         1762 
##                            Military Hospital 
##                                          896 
##                                        Other 
##                                          885 
## Princess Christian Maternity Hospital (PCMH) 
##                                          422 
##                              Rokupa Hospital 
##                                          454 
##                                         <NA> 
##                                         1469

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

If you need to use logic statements with multiple variables, or want to use operators like %in%, use case_when() instead, as below. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic!

linelist <- linelist %>% 
  mutate(hospital = case_when(hospital == "Connaught Hopital"                           ~ "Connaught Hospital",
                              hospital == "Rokupa Hopital"                              ~ "Rokupa Hospital", 
                              hospital %in% c("Mitylira Hopital", 
                                             "Mitylira Hospital",
                                             "Mitylira Hospital",
                                             "Military Hopital")                        ~ "Military Hospital",
                              is.na(hospital)                                           ~ "Missing",
                              hospital == "Princess Christian Maternity Hopital (PCMH)" ~ "Princess Christian Maternity Hospital (PCMH)",
                              TRUE                                                      ~  hospital) 
)
table(linelist$hospital, useNA = "always")
## 
##                           Connaught Hospital 
##                                         1762 
##                            Military Hospital 
##                                          896 
##                                      Missing 
##                                         1469 
##                                        Other 
##                                          885 
## Princess Christian Maternity Hospital (PCMH) 
##                                          422 
##                              Rokupa Hospital 
##                                          454 
##                                         <NA> 
##                                            0

To change missing values to a character value, such as “Missing”, use the function replace_na() in the same manner as recode above:

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

Likewise you can quickly convert character values to NA using na_if(), as below:

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

logical re-coding

Intro For simple cases you can use ifelse() or if_else(). In most cases it is better to use case_when().

ifelse() and if_else():
These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if TRUE, value if FALSE). if_else() is a special version from dplyr that handles dates.

Stringing together ifelse statements - NOT ADVISED!! Difficult to read and keep track of.

IMAGE of ifelse string with X across is.

Use case-when() instead.

You can reference other variables with the ifelse() function within mutate():

linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))
linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Missing if… na_if() lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),

coalesce()

if_else(), ifelse()

recode CLEANING MISSPELLINGS HOSPITAL NAME

Replace

Using cleaning dictionaries

CAUTION: This tab may follow from previous tabs.

## load cleaning rules and only keep variables in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
  filter(variable %in% c(names(mll_raw), ".global"))

## define variables that are not cleand
unchanged <- c(
  "epilink_relationship",
  "narratives",
  "epilink_relationship_detail"
)

mll_clean <- mll_raw %>%
  ## convert to tibble
  as_tibble() %>%
  ## clean variables using cleaning rules
  clean_data(
    wordlists = mll_cleaning_rules,
    protect = names(.) %in% unchanged
  )

Mutating grouped data

CAUTION: This tab may follow from previous tabs.

Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html

Taken from website above:

Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:

starwars %>%
  select(name, mass, species) %>%
  group_by(species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.

Using base R

If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new variable by assigning it a value. In the command below, the variable new_var does not exist until after the command is executed. In this simple example the variable is assigned the static value “new value”, so for all rows the value will be “new value”.

linelist_raw$new_var <- "new value"

You can also give the new variable a dyanmic value as shown below, or using the case_when() command explained in the next tab.

# mathematical operators on another variable
linelist_raw$new_var <- linelist_raw$OLD_var * 4

Groups by condition (case_when())

CAUTION: This tab may follow from previous tabs.

TODO tutorial on using case_when()

Numeric groups

For example, creating age groups cut()

case_when()

age_categories() (R4Epis package)

by percentile

WHAT TO DO IF AGE IS SPREAD ACROSS TWO VARAIBLES (e.g. numeric age + unit)

Highest in hierarchy

CAUTION: This tab may follow from previous tabs.

Within a group, indicate/convert to the highest value in the group

Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)